This report simply summaries the COVID-19 cases and death verse time, region, and transmission by some plots, and analyzes the relationship between the number of COVID-19 confirmed cases and some variables such as some underlying disease that we are interested in. It also summarizes the results of statistical modeling and validated the model assumptions we made. We are trying to investigate the causal inference and association between the COVID-19 infection rate and other factors.
The coronavirus that is ravaging the world, it starts with a fever, fatigue and dry cough. The more severe cases involve trouble breathing after one week. Older people and those with underlying medical problems like chronic kidney disease, tuberculosis, coronary heart disease are more likely to develop serious illness. The COVID-19 seems to select only a few infected people developing obvious symptoms and people who already have other disorders and chronic diseases are more likely to develop severe symptoms or even die after infection. Although the group with the most severe symptoms is usually the elderly or people with underlying diseases such as heart disease, some of those who die from COVID-19 have been healthy or even relatively young.
The impact of the outbreak on developed countries is enormous, we know that the outbreak of COVID-19 is concentrated in those economically developed countries and regions, compared with countries and regions with backwards economies, the outbreak is not very serious. This is actually reasonable because developed countries and regions tend to have particularly high degree of urbanization, which means higher population density.
The above statements may vary from region to region, thus we want to study the differences in COVID-19 infection rate in both developed and developing countries and to explore the impact of five kinds of diseases that include chronic kidney disease, diabetes, tuberculosis, obesity rate, and smoke rate on COVID-19 infections. We conducted a series of analyses on this issue based on data obtained from the WHO.
The outbreak of Coronavirus Disease is at the end of 2019, and now the United States has the most number of confirmed cases。 Coronavirus Disease 2019 (COVID-19) is an infectious disease caused by a newly discovered coronavirus, SARS-CoV-2. The virus can cause mild to severe respiratory illness, and recover without requiring special treatment. Older people, and those with underlying medical problems like cardiovascular disease, diabetes, chronic respiratory disease, and cancer are more likely to develop serious illnesses. There are already many known studies about the COVID-19. The two data set used in our analysis, one is from the World Health Organization (“https://covid19.who.int/WHO-COVID-19-global-data.csv”), which contains the date, country, region, new cases, new deaths, cumulative cases, and cumulative death. The other one is data that collects and merges some data from WHO, Wikipedia, and some other data set. It contains the country, region, cumulative cases and death per 100000 people, and some global underlying disease rate such as TB, kidney disease, diabetes, also with the smoke rate and obesity rate.
In this section, we do the summary statistics of two data sets. One is the data set from WHO, which shows the daily and cumulative COVID-19 cases and deaths with the date and countries. The other one is the data containing the COVID-19 cases and deaths, which also contains some other variables that we are interested in.
To get a good sense of how the different regions or countries can affect the infectious rate, we observe the map below, which plots each country in the global world with different shades of blue to signify the different populations of COVID-19 cases. As we can see, the United States of America (USA) has the largest number of COVID-19 cases, which is more than 20000000 cases. Also, Brazil, India, and some countries in Europe have a great number of COVID-19 cases. This map figure shows the cumulative number of cases in countries around the world, suggesting that there may be a relationship between different countries.
Next, we want to find trends in the number of COVID-19 cases and deaths versus time series.
We plot two different figures to compare the cumulative situation and daily new report situation. As time goes by, it is obvious that the cumulative cases and cumulative deaths continue increasing without truing and vibration. While the daily new cases and new deaths started as an upward trend, then reached a peak near January 2021, and then began to decline. One interesting thing is that there was a sharp drop on December 27, 2020, of the COVID-19 new cases, but it only lasted two or three days. On December 30, the number increased sharply and returned to almost the same number as on December 26. In all, from October 2020 to January 2021, the number of new COVID-19 cases per day was six to seven times higher than in April-July 2020 and twice as high as in July to October 2020. Then we also want to find out the trend of the mortality rate of the people who already have the COVID-19.
Similarly, we compare the cumulative mortality rate and daily mortality rate. From the plot, we can see that the variation of daily mortality rate is larger than the variation of cumulative mortality rate, and they both have peaks. A global maximum daily mortality rate is from February to March 2020, and a local maximum during April to May 2020. The maximum cumulative mortality rate is also around April to May 2020. The mortality rate is low around July 2020 to January 2021, this might be because of the huge increase in the daily new cases but the relatively stable deaths.
We have analyzed the number of new and cumulative deaths and cases from the WHO dataset. We are interested in the impact of the coronavirus in both developed and developing countries and we want to look at the impact of five kinds of diseases on infection rates of COVID-19.
We set a new dataset from the WHO website, after cleaning data, we contain 109 countries, regions, cumulative cases per 100000 population, transmission. Also, chronic kidney disease, smoke rate, diabetes rate, tuberculosis rate, obesity rate, developed countries are added which are obtained from WHO or Wikipedia.
The plot above analyzes the cumulative cases per 100000 in six regions due to different modes of transmission. Clusters and community accounted for most of the cases of transmission. The virus is more spread through community transmission, especially in Africa, America, and Europe. Clusters of cases dominate in Eastern Mediterranean, and south-East Asia.
To analyze our data more accurately, We divided the five diseases into four levels based on the incidence of disease in order from small to large.
The five box charts above show the relationship between the rate of different diseases and the rate of new crown infection. The median thickness for every group seems to be different. The cumulative cases per 100000 decrease with the increase of incidence of chronic kidney and tuberculosis, increasing with the increase of smoke rate and obesity rate. As the prevalence of diabetes rate increases, cumulative cases per 100000 increase and then decreases.
This plot shows the distribution of new coronavirus cumulative cases per 10000 in both developed and developing countries. COVID-19 spreads more in developing countries when the infection rate is less than 25,00 per 100000. The COVID-19 is more severe in developed countries than in developing countries once the infection rate reaches 25,000.
\(Y^{\frac{1}{4}}=\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3+\beta_4X_4+\beta_5X_5+\beta_6X_6+\beta_7{X_3}^2+\beta_8{X_5}^2+\epsilon\), where the \(\epsilon\) is the error term follows normal distribution.
Y: numeric. The total diagnostic rate of COVID-19 in each country at 02/28/2021[1].
X1 numeric. The incidence of Chronic kidney disease (CDK) [2]
X2 numeric. The proportion of the population who smoke in 2019 [3].
X3 numeric. The incidence of diabetes in 2019 [4].
X4 numeric. The incidence of tuberculosis in 2019 [5].
X5 numeric. The proportion of the population that is obese in 2016 (Derived from The World Factbook authored by the Central Intelligence Agency [6]).
X6 binary. Indicator of the country is developed or developing in 2021 (Human Development Index is greater than 80) [5].
The model requires the four assumptions: 1. Linearity: There must be a linear relationship between the response variable and the independent variables. 2. Multivariate normality: the residuals are normally distributed. 3. No or little multicollinearity: the independent variables are not highly correlated with each other. 4. Homoscedasticity: the variance of error terms are similar across the values of the independent variables.
We included five covariates representing the proportion or incidence of population health conditions, and we think the higher or lower of these types of factors may be related to how the COVID-19 virus spread in the country. For example, tuberculosis is an infectious bacterial disease which is similar as COVID-19, and we expect they have positive correlation. The two second degrees of polynomial terms are included in the model because there is a non-linear relationship in the residual plot if no polynomial terms are included.
Moreover, to analyze the causal inference between the total infection rate of COVID-19 and the indicator of whether a country is developed, we use the propensity score to reduce the selection bias in some degrees from our binary treatment (X6), and fit the new data into the model again. The relationship between the treatment and covariates are explored by simple linear regression.
First, we create a multi-linear model to investigate the relationship between COVID-19 diagnostic rate, developedCT and other disease incidence variables. However, from the residual vs fitted plot, we can see that there is a non-linear relationship between the response variable diagnostic rate and the other explanatory variables. Therefore, we consider using a polynomial model. In addition, the points in this plot are not equally distributed around the line, so we transform our response variable to the 4th square root in order to satisfy the homoscedasticity assumption.
The summary output of our model lm1 is shown as follows. From the output, we can see that diabetes,Obesity and DevelopedCTYES are significant at level 0.01, which means that there are association between our response variable COVID-19 diagnostics rate and these significant variables. Then we make diagnostic plots of this lm1 model.From the “Residual vs Fitted” plot, there is no obvious relationships between residuals and fitted values, which is good. Also, the spread of the points are approximately equal around the horizontal line, thus we can assume the homogeneity of variances here. Since most of points fall on the Normal Q-Q line, thus we can assume normality here.
##
## Call:
## lm(formula = inferate^(1/4) ~ kidneys + smoke + diabetes + TB +
## Obesity + DevelopedCT + I(diabetes^2) + I(Obesity^2), data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81364 -0.14044 0.05121 0.14880 0.55247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1612071 0.2239492 0.720 0.47330
## kidneys -0.0122720 0.0151639 -0.809 0.42027
## smoke 0.0047077 0.0030517 1.543 0.12608
## diabetes 0.0679543 0.0251013 2.707 0.00798 **
## TB -0.0765644 0.2248631 -0.340 0.73420
## Obesity 0.0526435 0.0124774 4.219 5.4e-05 ***
## DevelopedCTYES 0.2012663 0.0730277 2.756 0.00695 **
## I(diabetes^2) -0.0034009 0.0012045 -2.824 0.00573 **
## I(Obesity^2) -0.0008944 0.0003453 -2.590 0.01102 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.252 on 100 degrees of freedom
## Multiple R-squared: 0.6713, Adjusted R-squared: 0.645
## F-statistic: 25.53 on 8 and 100 DF, p-value: < 2.2e-16
To detect the causal inference of our treatment variable, we need to remove the selection bias of our model by using propensity scores. After we get our new model. From the output, the treatment group DevelopedCTYES is significant under level 0.001, thus we conclude that there may exist an causality between Covid-19 diagnostic rate and the developing status of the country. We also make model diagnostics, where normal-QQ plot becomes better and residual vs. fitted plot is roughly good. Therefore this new model nearly satisfy the normality and equal variance assumption.
##
## Call:
## lm(formula = inferate^(1/4) ~ kidneys + smoke + diabetes + TB +
## Obesity + DevelopedCT + I(diabetes^2) + I(Obesity^2), data = mydata,
## weights = weight)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -1.1445 -0.1222 0.0782 0.1945 0.6217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.272924 0.203909 1.34 0.18378
## kidneys -0.014627 0.013862 -1.06 0.29389
## smoke 0.003723 0.002792 1.33 0.18552
## diabetes 0.063515 0.025188 2.52 0.01326 *
## TB -0.064587 0.249069 -0.26 0.79593
## Obesity 0.045969 0.012183 3.77 0.00027 ***
## DevelopedCTYES 0.184412 0.050358 3.66 0.00040 ***
## I(diabetes^2) -0.003551 0.001194 -2.97 0.00368 **
## I(Obesity^2) -0.000699 0.000316 -2.21 0.02929 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.301 on 100 degrees of freedom
## Multiple R-squared: 0.639, Adjusted R-squared: 0.61
## F-statistic: 22.1 on 8 and 100 DF, p-value: <2e-16
Moreover, we think that diabetes and obesity may have an influence on the control (DevelopedCT = no) and the treatment (DevelopedCT = yes) group, so we performed a balance analysis to assess the degree of bias to treatment and control groups before and after the use of propensity scores, respectively.You can see the comparison table as follows. From the table, it seems that the selection biases by kidneys, smoke, TB and obesity are effectively removed after propensity score weighting.
| Variable | p_value | p_value_weight |
|---|---|---|
| kidneys | 5.468263e-12 | 0.9749 |
| smoke | 0.001443113 | 0.38693 |
| diabetes | 0.645696 | 1.910052e-06 |
| TB | 5.823153e-08 | 0.002348 |
| Obesity | 4.846996e-10 | 0.1996 |
By applying into our model, the data suggest the total diagnostic rate of Covid-19 in each country is significantly associated with the proportion of obesity (p=0.00005) and the proportion of diabetes (p = 0.00798). The coefficient of obesity and diabetes are both positive, meaning they are positively correlated to the Covid-19 diagnostic rate.
In our balance analysis, the country type is significantly related to CDK, TB, and Obesity. Based on our fitted model, to be able to appropriately discuss the causal inference, we need to reduce the selection bias which may be caused by the obesity factor, so we use this calculated propensity score. This is because the country type may be the confounder to the covid diagnostic rate, and both higher covid diagnostic rate and different type of country may be caused by the obesity rate.
After using the propensity score to weight the data, we find whether a country is developed is significantly related to the Covid-19 diagnostic rate with p-value 0.0004 and coefficient 0.1844. Therefore, we conclude that there is a causal relationship between the two variables, and we are expecting a higher average of diagnostic rate of Covid-19 in the developed countries than the developing countries.
We think there are many reasons that could lead to the causality between whether a country is developed or not and Covid-19 diagnostic rate. For example, the developed countries are likely to have a higher population mobility, or the people in the developed countries are more likely to pursue liberty so that they are not willing to wear a mask.
One downside of our approach is that after applying the propensity score, the p-value of the regression coefficient between country type and diabetes becomes \(1.91*e^{-06}\) from 0.64. Even though the p-value is not significant enough based on 0.05 level, the covariate Diabetes can increase the selection bias when we do the causal analysis. Therefore, other methods may be suggested in the data.
Tracing how virus spread is the core topic of epidemiology, and here, we conclude the diagnostic rate of Covid-19 is influenced by the developing status of the country. For further study, exploring and verifying how the relationship happens is a good direction to investigate.
[1] WHO Coronavirus (COVID-19) Dashboard. (2021). WHO Coronavirus Disease (COVID-19) Dashboard. https://covid19.who.int/
[2] Lancet. Global, regional, and national burden of chronic kidney disease, 1990–2017: a systematic analysis for the Global Burden of Disease Study 2017. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30045-3/fulltext
[3]Estimate of current tobacco use precalence(%). (2020). WHO. https://www.who.int/data/gho/data/indicators/indicator-details/GHO/gho-tobacco-control-monitor-current-tobaccouse-tobaccosmoking-cigarrettesmoking-agestd-tobagestdcurr
[4]Countries ranked by Diabetes prevalence (% of population ages 20 to 79). (2019). Indexmundi. https://www.indexmundi.com/facts/indicators/SH.STA.DIAB.ZS/rankings
[5] Prevalence of obesity among ADULTS, Bmi = 30 (CRUDE estimate) (%).. Retrieved March 05, 2021, from https://www.who.int/data/gho/data/indicators/indicator-details/GHO/prevalence-of-obesity-among-adults-bmi-=-30-(crude-estimate)-(-)
[6] Human development reports. (n.d.). Retrieved March 05, 2021, from http://hdr.undp.org/en/indicators/137506